Identification of embedded mathematical formulas in PDF documents using SVM
نویسندگان
چکیده
With the tremendous popularity of PDF format, recognizing mathematical formulas in PDF documents becomes a new and important problem in document analysis field. In this paper, we present a method of embedded mathematical formula identification in PDF documents, based on Support Vector Machine (SVM). The method first segments text lines into words, and then classifies each word into two classes, namely formula or ordinary text. Various features of embedded formulas, including geometric layout, character and context content, are utilized to build a robust and adaptable SVM classifier. Embedded formulas are then extracted through merging the words labeled as formulas. Experimental results show good performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale e-Book production.
منابع مشابه
A Preprocessing and Analyzing Method of Images in PDF Documents for Mathematical Expression Retrieval
PDF documents are the important information resources for a mathematical expression retrieval system. As a major component of PDF documents, the image objects must be converted to coded form with the help of character recognition and document analysis technology firstly for content based searching. Therefore, the quality of these images becomes the key factor which decides the correctness in th...
متن کاملMathematical Formulas Extraction
As a universal technical language, mathematics has been widely applied in many fields, and it is more accurate than any other languages in describing information. Therefore, numerous mathematical formulas exist in all kinds of documents. There is no doubt that automatic mathematical formulas processing is very important and necessary, of which extract formulas from document images is the first ...
متن کاملA New Approach for Recognizing Offline Handwritten Mathematical Symbols Using Character Geometry
There are several problems in pattern recognition system like feature extraction problem and identification, pre-processing and classification problem etc. One of the application domains in pattern classification is handwritten character or symbolic recognition. Identifying handwritten characters is always a complex and challenging task for the researchers. Wide research has been done on the ch...
متن کاملdvi2svg: Using LATEX layout on the Web
The problem of presenting mathematical formulas on the Web is non-trivial. Current systems offer only partial answers to such requirements as the guaranteed layout on the client side or the availability of font glyphs. We describe dvi2svg, a system to convert TEX’s output into Scalable Vector Graphics. This approach responds to the requirements above and several others. We also present how it h...
متن کاملExtracting Precise Data on the Mathematical Content of PDF Documents
As more and more scientific documents become available in PDF format, their automatic analysis becomes increasingly important. We present a procedure that extracts mathematical symbols from PDF documents by examining both the original PDF file and a rasterized version. This provides more precise information than is available either directly from the PDF file or by traditional character recognit...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012